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It has been conjectured that evolution exerted 
pressure to preserve amino acids bearing ther- 
modynamic, kinetic, and functional roles. In this 
letter we show that the physical requirement to 
maintain protein stability gives rise to a sequence 
conservatism pattern that is in remarkable agree- 
ment with that found in natural proteins. Based 
on the physical properties of amino acids, we pro- 
pose a model of evolution that explains conserved 
amino acids across protein families sharing the 
same fold. 

Molecular evolution is a sumptuous natural laboratory 
that provides an invaluable source of information about 
the structure, dynamics and function(s) of biomolecules. 
This information has already been widely used to under- 
stand the folding kinetics, thermodynamics, and function 
of proteins (e.g. The basic belief behind the ma- 

jority of such studies is that evolution optimizes, to a 
certain extent, the properties of proteins, so that they 
become more sufficiently stable, and have better folding 
and functional properties. 

Recent studies [^,^ identified positions in several com- 
mon protein folds where amino acids are universally con- 
served within each family of proteins having that fold. 
Such positions are localized in structure, and their un- 
usually strong conservatism may be due to functional 
reason (e.g. super-site), or folding kinetics (folding nu- 
cleus) In contrast to function and folding kinetics, 
evolutionary pressure to maintain stability may be ap- 
plied "more evenly" because all amino acids contribute, 
to a lesser or greater extent, to protein stability via their 
interaction with other amino acid residues and with wa- 
ter. 

In this letter we develop a model that provides a ratio- 
nale for conservatism patterns caused by selection for sta- 
bility. Our model is of equilibrium evolution that main- 
tains stability and other properties achieved at an earlier, 
prebiotic stage. To this end we propose that stability se- 
lection accepts only those mutations that keep energy of 
the native protein, E, below a certain threshold Eq neces- 
sary to maintain an energy gap The requirement 
to maintain an energy threshold for the viable sequences 
makes the equilibrium ensemble of sequences analogous 
to a microcanonical ensemble. In analogy with statistical 
mechanics, a more convenient and realistic description of 
the sequence ensemble is a canonical ensemble, whereby 
strict requirements on energy of the native state is re- 
placed by a "soft" evolutionary pressure that allows en- 
ergy fluctuations from sequence to sequence but makes 



sequences with high energy in the native state unlikely. 
In the canonical ensemble of sequences, the probability 
of finding a particular sequence, {u}, in the ensemble 
follows the Boltzmann distribution [p|-p^ 
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where T is the effective temperature of the canonical en- 
semble of sequences that serves as a measure of evolu- 
tionary pressure and Z — sxp {~ ^ {a} / T) is the 
partition function taken in sequence space. 

Next, we apply a mean- field approximation that re- 
places all multiparticle interactions between amino acids 
by interaction of each amino acid with an effective field 
$ acting on this amino acid from the rest of the protein. 
This approximation presents -P({ct}) in a multiplicative 
form as 11^ i Pi'^k) of probabilities to find an amino acid 
a at position k. p{(Jk) also obeys Boltzmann statistics 
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The mean field potential $(crfe) is the effective potential 
energy between amino acid a and all amino acids inter- 
acting with it, i. e. 
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Matrix U describes energy parameters in contact approx- 
imation and matrix A is a contact matrix for protein na- 
tive structure (see Methods for more detail). The poten- 
tial is similar in spirit to the protein profile introduced 
by Bowie et al. |l3| to identify protein sequences that 
fold into a specific 3D structure. 

For each member, m, of the fold family (FSSP database 
|l^ ) we compute the mean-field probability, Pm{<^k), us- 
ing Eq. (H) and then, we compute the average mean-field 
probability. 
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Eqs. (H) — (H), along with properly selected energy func- 
tion, U, make it possible to predict probabilities of all 
amino acid types and sequence entropy SMpik) at each 
position k 
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from the native structure of a protein. The summation 
is taken over all possible values of a. 

Theoretical predictions from statistical-mechanical 
analysis can be compared with data on real proteins. 
In order to determine conservatism in real proteins we 
note that the space of sequences that fold into the same 
protein structure presents a two-tier system, where ho- 
mologous sequences are grouped into families and there 
is no recognizable sequence homology between families 
despite the fact that they fold into closely related struc- 
tures |,[5|jl|]. 

Using the database of protein families with close se- 
quence similarity (HSSP database 0), we compute fre- 
quencies of amino acids at each position, fc, of aligned 
sequences, Pm{pk), for a given, rnth, family of proteins. 
We average these frequencies across all Ns families shar- 
ing the same fold that are present in FSSP database : 

1 

PacricTk) = ITT ^ Pm{crk) ■ (6) 
m— 1 

Next, we determine the sequence entropy, Sacr{k), at 
each position, fc, of structurally aligned protein analogs: 

Sacr{k) = - ^Pacrlcfe) In FacrCo-fc)- (7) 
cr 

If the stability selection were a factor in evolution of 
proteins and our model captures it, then we should ob- 
serve correlation between predicted mean field based se- 
quence entropies, SMpik), and actual sequence entropies 
Sacr{k) in real proteins. Thus, the question is: "Can 
we find such T, so that predicted conservatism profile 
SMpik) matches the real one Sac7^{k)7 

By varying the values of the temperature, T in the 
ran^e 0.1 < T < 4.0, we minimize the distance, = 
Sfe=i('5'A/-F(fc) ~ Sacr{k))'^, between the predicted and 
observed conservatism profiles. We exclude from this sum 
such positions in structurally aligned sequences that have 
more than 50% gaps in the structural (FSSP) alignment. 
We denote by Tgei the temperature that minimizes D. 

We study three folds: Immunoglobulin fold (Ig), 
Oligonucleotide-binding fold (OB), and Rossman fold 
(R) . We compute correlation coefficient [|8| between val- 
ues of SMpik), obtained at T^ei, and Sacr{k) for all three 
folds. The results are summarized in Table |[ The plots 
of SMpik) and Sacr{k) versus k as well as their scatter 
plots are shown in Figs. |l|-^(a,b). 

The correlation between SMpik) and Sacrik) is re- 
markable for all three folds and indicates that our mean 
field model is able to select the conserved amino acids in 
protein fold families. It is fully expected that the corre- 
lation coefficient is smaller than 1. The reason for this 
is that computation of Si\ip{k) takes into account evolu- 
tionary selection for stability only and it does not take 
into account possible additional pressure to optimize ki- 
netic or functional properties. 



The additional evolutionary pressure due to kinetic 
or functional importance of amino acids results in pro- 
nounced deviations of Smp from Sacr for few amino 
acids that may be kinetically or functionally important. 
A number of amino acids whose conservatism is much 
greater than predicted by our model form a group of 
"outliers" , from otherwise very close correspondence be- 
tween Smp and Sacr- To demonstrate that some of those 
amino acids are important for folding kinetics and as such 
they can be under additional evolutionary pressure, we 
color data points on Smp versus Sacr scatter plot ac- 
cording to the range of 0-values that corresponding 
amino acids fall into. Thermodynamics and kinetic role 
of individual amino acids was studied extensively (i) hy 
Hamill et al. Q for the TNfnS (ITEN) protein, and 
(ii) by Lopez-Hernandez and Serrano |^ for the CheY 
protein. We use the (/)-values for individual amino acids 
obtained in ||2^,|l|. We observe that (i) for TNfnS pro- 
tein most of the points on Fig. 0(b) that belong to the 
outlier group have 0-values ranging from 0.2 to 1, and 
(ii) for CheY protein most of the points (for which (p- 
values are known) on Fig. ^(b) that belong to the outlier 
have 0- values ranging from 0.3 to 1. 

To conclude, we presented a theory that explains se- 
quence conservation caused by the most basic and univer- 
sal evolutionary pressure in proteins to maintain stability. 
The theory predicts very well sequence entropy for the 
majority of amino acids, but not all of them. The amino 
acids that exhibit considerably higher conservatism than 
predicted from stability pressure alone are likely to be im- 
portant for function and/or folding. Comparison of the 
"base- level" stability conservatism S]\ip{k) with Sacr{k) 
- actual conservatism profile of a protein fold - allows 
to identify functionally and kinetically important amino 
acid residues and potentially gain specific insights into 
folding and function of a protein. 



Methods 

Protein model 

We represent interactions in a protein in a C/3 approx- 
imation — each pair of amino acids interact if their C^s 
(Cq, in the case of Gly) arc within the contact distance 
7. 5 A The total potential energy of the protein can 
be written as follows: 

1 ^ 

E=oJ2U(.'^^,<^l)^^J, (8) 

where N is the length of the protein, and (Tj is the amino 
acid type at the position i = I, . . . , N. U{ai, aj) is the 
corresponding element of the matrix of pairwise interac- 
tions between amino acids ai and <Tj . Ay is the element 
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of the contact matrix that is defined to be 1 if there is a 
contact between amino acids i and j , and is otherwise. 

Six-letter code potential 

Due to the similarities in the properties of the 20 
types of amino acids one can classify these amino acids 
into 6 distinct groups: aliphatic {AVLIMC}, aromatic 
{FWYH}, polar {STNQ}, positive {KR}, negative 
{DE}, and special (reflecting their special conforma- 
tional properties) {GP}. We construct the effective po- 
tential of interaction, Ue{ai,aj), between six groups of 
amino acids, ct, by computing the average interaction be- 
tween these groups, i. e. 

U6{a^,a-i) ^ C/20K-,CT/), (9) 

where a denotes amino acids in the 20-letter representa- 
tion, and U20 (ffc ,(Ti) is the 20-letter Miyazawa- Jernigan 
matrix of interaction p3t ; a denotes amino acids in the 
6-letter representation. is the number of actual amino 
acids of type ct, e. g. for the aliphatic group, — 6. 
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TABLE I. The values of the correlation coefficient r for -fc* 
the linear regression of SMF{k) versus Sacr for Ig, OB, and R, Cjj 
folds and the corresponding optimal values of the temperature 

T = Tsel ■ 
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FIG. 1. (a) The values Shipik) (black line) and Sacr{k) 
(red line) for all positions, k, for the Ig-fold. The lower the 
values of SMF{k) the more conservative amino acids are at 
these positions, (b) The scatter plot of predicted SMF{k) 
versus observed Sacr{k). The linear regression correlation co- 
efficients are shown in Table |. Blue line is the linear regres- 
sion has the slope different than 1 (red line), corresponding 
to the SMF{k) — Sacr{k) relation, (c) The histogram of the 
differences between SMF{k) and Sacr{k). In (b) we assign 
colors to data points corresponding to amino acids with the 
specific range of 0-values |20|: red, if 0.5 < 4> < 1, yellow, if 
0.2 < < 0.5, magenta, if 0.1 < (j) < 0.2, violet if (/> < 0.1, 
and black if (/>- values are not determined. 
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FIG. 2. (a) — (c) The same as Fig.|l] but for the OB-fold. 
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FIG. 3. (a) — (c) The same as Fig.§ but for R-fold. In (b) 
we assign colors to data points corresponding to amino acids 
with the specific range of 4>-values pi|: red, if 0.3 < 4> < 1, 
yellow, if 0.1 < <j) < 0.3, violet if < 0.1, and black if values 
are not determined. 
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